gogglesfandomcom-20200222-history
Advanced Search
What we did We allow the user to search in all fields that the XML file provides, and we've also created some extra fields based on our analysis of the XML document (see: Indexing the Collection). This means the user can search in: * The title of the letter. * The introduction of the letter. * The questions of the letter. * The answers of the letter. * All parts of the letter. Also, the user can specify: * The date of the letter (using a date range). * The author of the letter (see Frequency Table). On top of that, the XML files provide 'entities' - links to Wikipedia pages for certain terms. We also indexed these and gave the user the option to tag words in the description. To address the needs of professionals, we also made advanced search possible. This means that for every field (title, introduction, questions, answers and all parts), the user can use: * Wildcards (e.g. isla* for 'islam', 'islamiet', 'islamofobie' and 'islamisering'). * Regular expressions. * Fuzziness (e.g. Telergaaf~). * Proximity searches (e.g. "Stichting Amsterdam"~2). * Phrase queries. * Boosting (to make a term more relevant than the others). * Boolean operators. * Grouping (e.g. stichting AND (Pantar OR Paardenopvang)). Please note: if a query for one field consists of multiple terms and the user used no Boolean operators between those terms, we give the documents that contain all (or: most) query terms the highest ranking, but we also return documents that just contain one (or: a few) of the query terms. What works well We're really proud of how well our indexer recognizes introductions, questions and answers. Searching in these specific fields works really well. We also believe that our search engine is very easy to understand, but at the same time offers a lot of features for professional users. If you wish, you can just search in all the fields, and specify a date range at the left. However, it's also possible to search in specific fields and use advanced operators in your query, like wildcards, boolean operators, proximity searches, boosts and groups. What has to be improved The Wikipedia tagging isn't verry accurate. Strictly speaking, this is not our "fault", since these tags are specified in the XML files that were given to us. However, if you enable this tagging, you'll notice that the Wikipedia page that is matched to a word, often isn't the correct one. We tried to improve this as much as we could, e.g. by preventing stop words (like 'de' and 'het') from being tagged, since these tags were most often wrong. However, we couldn't do more than that. Because the word tagging is still quite hit-or-miss, we disabled it by default. If the user wishes to, he/she can enable it in the Advanced Search box. Evaluation of quality We believe we've created an impressive amount of advanced search options in the time that was available for this assignment. Not only can the user search in many fields (even ones that weren't specified in the XML document, but analytically parsed by the indexer), it's also possible to use advanced operators. The only thing that doesn't work well is the Wikipedia tagging, but that's why we disabled that by default. Overall, we think our advanced search options are very strong.